Exploiting document graphs for inter sentence relation extraction

Background Most previous relation extraction (RE) studies have focused on intra sentence relations and have ignored relations that span sentences, i.e. inter sentence relations. Such relations connect entities at the document level rather than as relational facts in a single sentence. Extracting facts that are expressed across sentences leads to some challenges and requires different approaches than those usually applied in recent intra sentence relation extraction. Despite recent results, there are still limitations to be overcome. Results We present a novel representation for a sequence of consecutive sentences, namely document subgraph, to extract inter sentence relations. Experiments on the BioCreative V Chemical-Disease Relation corpus demonstrate the advantages and robustness of our novel system to extract both intra- and inter sentence relations in biomedical literature abstracts. The experimental results are comparable to state-of-the-art approaches and show the potential by demonstrating the effectiveness of graphs, deep learning-based model, and other processing techniques. Experiments were also carried out to verify the rationality and impact of various additional information and model components. Conclusions Our proposed graph-based representation helps to extract ∼50% of inter sentence relations and boosts the model performance on both precision and recall compared to the baseline model. Supplementary Information The online version contains supplementary material available at (10.1186/s13326-022-00267-3).


Background
Relation extraction (RE) is the task of discovering semantic connections between entities [1]. RE plays a vital intermediate step in a variety of natural language processing (NLP) and information extraction applications in the biomedical domain. Its applications range from precision medicine [2], adverse drug reactions identification [3,4], drug abuse events extraction [5], major life events extraction [6,7] to building question answering systems [8,9] and clinical decision support system [10].
Most previous RE studies followed the assumption that if two entities were related, they would belong to a single sentence and therefore ignored relationships expressed across sentence boundaries [11][12][13][14][15]. I.e., the task of RE *Correspondence: lhquynh@vnu.edu.vn 1 Faculty of Information Technology, VNU University of Engineering and Technology, Hanoi, Vietnam Full list of author information is available at the end of the article aims to classify the semantic relationship between an entity pair e 1 and e 2 in a given sentence S into a predefined relation class including 'not-relate' . However, relationships between entities are often expressed across sentence boundaries or otherwise require a broader context to disambiguate [16][17][18]. For example, 30% of relations in the Biocreative V Chemical-Disease Relation (BC5 CDR) dataset [19] are only expressed across sentence boundaries, such as in the following excerpt expressing complicated inter sentence relations.
In which, chemical 'carbachol' is annotated to the Chemical-induced Disease (CID) relations with four diseases 'nausea', 'hypotension', 'bradycardia' and 'asystole'. All of them are inter sentence relations: 'carbachol' only appears in the title and Sentence 1 while 'nausea' and 'hypotension' appear in Sentence 2 and 'bradycardia' and 'asystole' only appear in Sentence 3. These problems are exacerbated by the document-(rather than sentence-) level annotation, which is very common in the biological text [17].
Thus, the research community has gained an interest in devising methods to move beyond single sentences and extract semantic relations that span sentences. I.e., the task of inter sentence RE aims to identify the semantic relationship between a pair of entity mentions e 1 and e 2 in a given document D that contains several sentences S 1 , S 2 , ...S n . The extraction of inter sentence relations is much more difficult than intra sentence relations [20]. In some datasets, the involved entities of an inter sentence relation are marked in specific locations (example includes BB3 corpus [21]). DocRed dataset [22] annotates the relations and entities together with their corresponding supporting sentences. The inter sentence relation extraction problem becomes much more difficult in the datasets that a relation explores entities at the document level rather than that at the specific mentions. I.e., since several mentions of an entity appear in different locations in the text, we face the difficulty in locating which sentences containing the supporting evidence of a relation. This problem becomes more severe in the biomedical domain since biomedical documents often contain sentences with a long and more complex structure compared with that in the general domain. Moreover, many relations are expressed implicitly. When working with multiple sentences, extracting valuable information, and then understanding the contexts of entity pairs becomes much more difficult. There is a multitude of different relation types in the biomedical domain and potentially any pair of entities in the document could be related. For example, although BC5 CDR corpus is only annotated with CID relations, many pairs of entities can have therapeutic relations.
These characteristics lead to some challenges and require different approaches than those usually applied in intra sentence relation extraction. Despite some initial results, there are still limitations of recent approaches for inter sentence RE. The end-to-end model proposed in [23] resolved intra sentence relation classification partly by using a multi-pass sieve coreference resolution module. It has the drawback of strongly depending on the appearances of antecedent and anaphor representations of entities in the text since there are many inter sentence relations not expressed through anaphor. Another approach processes consecutive sentences as longer sentences. Examples include a Support Vector Machine (SVM)-based model with a very rich feature set [24], a hybrid model of the convolutional neural network, and maximum entropy (ME) [25] and a long short-term memory network (LSTM) and convolutional neural network model that learns document-level semantic representations [20]. Since inter sentence RE requires information from all local, non-local, syntactic, and semantic dependencies, several previous studies tried to build a representation for the whole document such as biaffine Relation Attention Networks (BRANs) [17] and the labeled edge graph convolutional neural network model on a document-level graph [18].
The novel approach we present in this paper draws inspiration from related works that explore the consecutive sentences for the inter sentence relation extraction. The construction of document subgraphs is also used to leverage both local and non-local information effectively. We then construct a deep neural architecture based on a shared-weight convolutional neural network (swCNN) with an improved attention mechanism to explore the information of multiple paths on the document subgraph. The experimental results on the BC5 CDR benchmark dataset show potential and are comparable to state-of-theart approaches. The investigation of the impact of different components and information on the final performance provides insights showing that the graph-based representation, swCNN model, instance merging/weighting technique and distant supervision learning are useful. It also leads us to conclude that the knowledge-based information, coreference information and attention mechanism are still promising areas for future research.

Materials and methods
We present this section in four main parts: the overview of our evaluated dataset; the overall picture of the proposed architecture and three main components in detail; additional techniques to improve model performance; and experimental configuration.

Dataset
Our experiments were conducted on the BioCreative V Chemical-Disease Relation dataset [19]. This corpus contained a total of 1500 PubMed articles that were separated into three subsets, each of 500 for the training, development and test set (the details are shown in Table 1). This dataset is annotated with chemicals, diseases and the chemical-induced disease relationships at abstractlevel. Relation annotations are asserted for both within and across sentence boundaries. Following the data survey of BioCreative [26], about 30% of total instances are inter sentence relationships.  (ii) In order to represent an instance by a set of paths, we apply several advanced techniques for finding, merging and choosing the relevant paths between entity pairs. (iii) In the next step, the advanced attention mechanism and several types of linguistic information are applied to explore the information from the document subgraphs more effectively. (iv) Lastly, to exploit these enriched representations effectively, we develop a shared weight Convolutional Neural Network model (swCNN).

Document subgraph construction
As we noted above two entities that participate in a relation may belong to different sentences. Dependency trees are often used to extract local dependencies of semantic relations in intra sentence relation extraction. However, such dependencies are not adequate for inter sentence RE since sentences have different dependency trees that are not connected. Because of this limitation, using the shortest dependency path to extract the local dependencies of semantic relations is not adequate for inter sentence RE.
To overcome these limitations, we construct a graph for consecutive sentences based on their dependency trees, called the document subgraph. In this graph, the nodes correspond to words and edges represent the connection between them. We make two assumptions: (i) the distance of two participating entities in a relation should not be too far (experimentally, two entities should be within five consecutive sentences). If two entities are too far apart, the method's effectiveness would be reduced, or this pair may be ignored. (ii) The title of the abstract is a special sentence that is related to every sentence in the abstract in a certain manner. Because of this assumption, the title is always used together with the abstract sentences to generate each subgraph.
Creating a document subgraph is a three-step process: Step 1: Generate the dependency tree for each sentence. All directed dependency labels are kept in the subgraphs as local dependency information.
Step 2: Merge the dependency trees of the sentences in each sliding window into a document subgraph.
The sliding window of size w indicates the number of consecutive sentences that we use to create the document subgraphs. w = 1 indicates a single sentence, i.e. With w = j, each j sentences are used to create a subgraph. Since two entity mentions can appear in different sentences, an unrestricted selection of text spans would risk generating many unexpected examples and lead to an explosion of computing space (see Instance merging.) We, therefore, limit w to 5, i.e., all relations with two entities that are not within 5 consecutive sentences are ignored. After this phase, each abstract will consist of several subgraphs.
Step 3: Create virtual edges for subgraphs. By using dependency trees, we already have local dependency information. In this step, we try to link new virtual edges by using several additional information: − NEXT-SENT edges connect root nodes in dependency trees of two consecutive sentences. They bring sequential non-local dependency information. − TITLE edges are created between two dependency tree roots of the Title and the first sentence in the sliding window. They provide non-local dependency information. − COREFERENCE edges link an anaphoric expression to its antecedent if identified by the multi-pass sieves coreference resolution method [23]. These edges show the semantic relation between terms. We divide this connection type into three specific types: (i) COREF-sent: anaphor and antecedent belong to two normal sentences, (ii) COREF-to-title: anaphor is in a normal sentence and antecedent is in the Title, (iii) COREF-from-title: anaphor is in the Title and antecedent is in a normal sentence. − KB-CTD edges are created between head nodes of two entities if they are annotated as having relation 'M' in the Comparative Toxicogenomics Database (CTD) 1 . We call it knowledge-based information.
These virtual edges are undirected and labeled by their names. We give a realistic example of the document subgraph in Additional file 1: Appendix A.
Using the subgraphs already constructed, this module finds all possible paths between two entities in each graph. We perform a breadth-first search on a graph to find all possible paths between two entities. The graph we constructed is quite complex, moreover, the complexity increases with the sliding window size w and the number of new virtual edges. A traversal in breadth-first order on such a large graph with cycles is resource-consuming (even if we never go back to the passed nodes to avoid the infinite issue).
To overcome this risk, we use two thresholds: − Maximum depth md : The maximum number of nodes traveling from the beginning node.
− the Maximum number of path k : The maximum number of paths that we collect.
Nearly all previous studies in relation extraction consider co-occurring entity pairs with known relations as positive instances for training. This assumption is reasonable for intra sentence relations, but the inter sentence problem presents a new challenge since this strategy would risk generating too many wrong examples. It is because a document has a relation between two entities does not mean that all spans of text contain these entities show that relation. Quirk and Poon [16] tackled this problem when an entity pair co-occurs in a large text span, and also co-occur in a smaller text span that overlaps with the larger one. In such cases, if there is a relation between the pair, most likely it is expressed in the smaller text span when the entities are closer to each other. To reduce the unexpected noise from the large text span, we apply a restriction of generating paths called 'minimal span' [16]. I.e., only the minimal span is chosen to generate the paths between two entities. A co-occurring entity pair has the minimal span if there does not exist another overlapping co-occurrence of the same pair. Since each abstract can have several subgraphs, in this phase, we receive several sets of paths. Figure 2 illustrates the instance merging technique. Firstly, we address two unexpected problems while generating the instance from the document subgraph. In Fig. 2-A, a pair of entities appear several times at different positions in an abstract. Because the BC5 CDR corpus has relations annotated at the abstract-level, all of these co-occurrences are treated as positive examples for the CID relation. In fact, only a few of them actually refer to the CID relation. This may cause much noise during training.

Instance merging
The example in Fig. 2-B shows the problem of unexpected instance repetition, especially when we widen the window to create subgraphs. In this example, we can generate three identical training instances, i.e., the training patterns of this instance are produced three times, changing the actual frequency of the representation in the training data. This issue may then lead the model to give this instance a higher priority (more important weight). We give a realistic example of these problems below: "<Title> Hemolysis of human erythrocytes induced by tamoxifen is related to disruption of membrane structure. . . . <S 1 > TAM induces hemolysis of erythrocytes as a function of concentration. <S 2 > The extension of hemolysis is variable with erythrocyte samples, but 12.5 microM TAM induces total hemolysis of all tested suspensions. <S 3 > Despite inducing extensive erythrocyte lysis, TAM does not shift the osmotic fragility curves of erythrocytes. <S 4 > The hemolytic effect of TAM is prevented by low concentrations of alpha-tocopherol (alpha-T) and alpha-tocopherol acetate (alpha-TAc) (inactivated functional hydroxyl) indicating that TAM-induced hemolysis is not related to oxidative membrane damage. Tackled with a title and 5 sentences as shown above and a sliding window size w = 3, we have 42 valid pairs of CID: TAM-Hemolysis. Each entity pair can potentially be described by up to 15 paths. As a result, if each pair CID: TAM-Hemolysis is considered as a positive instance, we may have too many 'similar' positive instances. The same problem also appears for negative instances. To solve this problem, we propose a technique called instance merging, in which, we extract all possible dependency paths between a pair of entity mentions and merge them into a single set for this entity pair. To reduce overlapping training instances, we remove the repeated paths (i.e., if several paths are totally identical, only one is kept).

Choosing top−k paths
After the instance merging phase, we have a set of several paths to represent a pair of entities. Some of them are useful, but others may be noise.
Prior works on intra sentence relation extraction often explored the single shortest path between two entities [27,28]. Applying these traditional approaches for inter sentence relation classification problem raises many problems. Firstly, we cannot take advantage of all the local and global features since they may appear in different paths; secondly, the shortest path may not the 'best' path.
In contrast to these previous approaches, we propose to consider a set of multiple paths as a novel representation for an entity pair. To reduce noise and model complexity, we only choose the top-k best paths. This leads to the problem of how to choose advantageous paths. In this work, we implement two strategies to choose the top-k paths: − Top-k shortest dependency paths, this strategy was also used by [16]. − Top-k paths with the highest number of repetitions.
To explore the information in this novel representation, we cannot use our previous models. Instead, a new deep learning architecture capable of simultaneously processing multiple paths was proposed, based on the swCNN.

Path representation
Before inputting to the model, each component in the dependency paths must be transformed into an embedding vector. In order to have an informative representation, we take advantage of various linguistic information along the dependency path, from the original dependency tree and other resources.
The dependency relations with directions are proven more effective than the dependency relations without directions for the relation extraction task [27]. However, treating the dependency relations with the opposite direction as two separate relations can induce two vectors for the same relation. We represent the dependency relations with two discrete components: d typ ∈ R dim typ represents the dependency relation type among 72 labels; and d dir ∈ R dim dir is the direction of the dependency relation, i.e. from left-to-right or vice versa on the Shortest Dependency Path (SDP). The final representation d i of dependency relation is obtained through a nonlinear transformation as follow: where the d typ and d dir vectors are generated by looking up the embedding matrices W e typ ∈ R dim typ ×72 and W e dir ∈ R dim dir ×2 respectively; W d and b d are trainable parameters of the network.
For token representation, we utilize two types of embeddings to represent the word information in different aspects, including: − Pre-trained fastText embeddings [29] learn the word representation based on its external context and n-gram sub-word information. Each token in the input paths is transformed into a vector t w i by looking up the embedding matrix W e w ∈ R dim we ×|V | , where dim we is the word embedding dimension, and V is the vocabulary of all words we consider. − POS tag embeddings captures (dis)similarities between grammatical properties of words and their syntactic structural roles within a sentence. We concatenate the part-of-speech (POS) tag information into the token representation vector. We randomly initialize the embeddings matrix W e p ∈ R dim pe ×56 for 56 OntoNotes 5.0 version of the Penn Treebank POS tags. Each POS tag label is then represented as a corresponding vector t p i .
We concatenate two embedding vectors of each token and transform them into the final token embedding as follow: Each token t i is concatenated with the corresponding attentive augmented information from its child nodes on the original dependency tree proposed by Can et al. [30]. Given a token t, the attentive augmented information is calculated using the token itself and the set of its M child nodes. Word embedding and POS tag embedding are concatenated to form token embedding vector t while the dependency relation from a direct ancestor is added to form a child node representation c i . The position embeddings d i is also used to reflect the relative distance from the i-th child to its parent on the original sentence.
Two sequential attention layers on the children of a token are used to produce children context vectors. A simple self-attentive network is applied to child nodes where the attention weights are calculated based on the concatenation of themselves with parent information and distance from the parent. I.e., where w d ∈ R dim d is the base distance embedding; W e and b e are weight and bias term.
A distance-based heuristic attentive layer is applied on the self-attentive children context vector to keep track of how close each child is to the target token, as follow: where f (d) = βd 2 with β = −0.03 is a heuristically chosen weighting function. Afterward, to capture the relevant and essential information from the output of the multi-attention layer and preserve the integrity of the word information, K kernel filters are applied to each child's attentive vector to produce K features from each child. The final augmented information a is captured by a max-pooling layer, i.e., where W f is the weight of K kernel filters; and b f is bias term.
Finally, this concatenation is transformed into an Xdimensional vector to form the representation x i ∈ R X of the token, i.e., where W x and b x are trainable parameters of the network.

Shared-weight convolutional neural network
Convolutional Neural Networks (CNNs) [31] are good at capturing the n-gram features in the flat structure and have also been proved effective in many natural language processing tasks including relation classification [14,17]. The typical structure of a shared-weight CNN (swCNN) is quite similar to the original CNN that is comprising convolution, pooling, fully-connected layers and softmax. The novel point is the ability to share weight between several convolutions, leading to the ability to process multiple data instances at once. Figure 3 illustrates the overall architecture of our swCNN model, which is comprised of two main components: multi-path representation and classification. Given a set of multiple k paths as input, each path is converted into a separated embedding matrix. A shared-weight convolution with relu activation layer is followed to capture convolved features from these embedding matrices simultaneously. The essential features are gathered using a filter-wise pooling layer before being classified by a fully connected layer with softmax classification.
In the embeddings layer, each component in the dependency path (i.e., token or dependency relation) is represented by a d-dimensional vector w e ∈ R d where d is the desired number of embedding dimensions as described in the previous section 'Path representation' . After the embeddings layer, the input multiple paths are transformed into: In general, let us define the vector x i,j:j+m as the concatenation of m tokens and m−1 dependency relation between them. I.e., (8) In the convolution layer, we apply N filters with region size r to these embedding matrices simultaneously. These filters move by dependency unit to keep the dependency information between tokens. Since the same filters are used for all matrices, our model can extract information from them at the same time, as well as suppress increases in the number of weight parameters then reduce the computational complexity. The filter-wise pooling step converges all outputs of a filter to a single element by choosing the essential feature from all CNN features. This architecture helps swCNN to use the information on multiple paths simultaneously, and from there, selects the truly outstanding features. I.e., the convolutional layer computes an element f p of the convolved feature vector f as follows: where W c ∈ R (rX+(r−1)D)×N and b c ∈ R k are the weight matrix and bias vector of the convolutional layer. At the classification phase, we have the number of features equal to the number of filters we used. They then are flattened into a feature vector and put through the softmax to decide the final prediction. I.e., the output f of the convolutional layer is then fed to a softmax classifier to predict a (K + 1)-class distribution over labelsŷ: where W y and b y are the parameters of the network to be learned. The proposed model can be stated as a parameter tuple θ = (W, b). To compute the model parameters θ, we define the training objective for a data sample as: where y ∈ {0, 1} (K+1) indicates the one-hot vector represented ground truth; and λ is a regularization coefficient.

Additional techniques Ensemble mechanism
Overfitting is one of the most notable problems of deep learning models. It happens when the neural network is very good at learning its training set, but cannot generalize beyond the training set (known as the generalization problem). The ensemble method [32] is one of the most effective paradigms to reduce variance and helps to avoid overfitting as well as improve the stability and accuracy of the model. Moreover, random initialization is demonstrated to have an impact on the model's performance on unseen data, i.e. training model instances may perform substantially better (or worse) than the averaged results [17,28,33]. An ensemble mechanism was found to reduce variability whilst yielding better performance than the averaging mechanism [17].
In this paper, we use a strict majority vote -a simple but effective ensemble method that has been successfully

Distant supervision learning
Distant supervision learning is proved its good impact on the relation classification by utilizing the knowledge base in some research [17,23,24]. In this work, we continue to apply distant supervision learning to the proposed subgraph models. In order to take advantage of the available resources, we do not rebuild the distant data ourselves. Instead, we use the CTD-Pfizer dataset [34] that has been successfully applied in [17,24]. Since this data does not contain entity annotations, we used Dnorm [35] and tmChem [36] tools to annotate the entities. This dataset contains 18,410 documents with 33,224 CID pairs (15,439 unique).

Experimental configuration and model's hyper parameters
Our model was implemented using Python version 3.5 and TensorFlow v1.15.0 2 . The dependency tree is generated using spaCy 3 . To generate the document subgraph, we set the maximum depth of md = 15 and the maximum number of paths k = 150 for the breadth-first search algorithm of pathfinding phase. Widening w more than 5 as it may bring a lot of noise information and cause a computational burden. Therefore, we limit the size of the sliding window w lower than 5, i.e., exclude all entity pairs that are apart more than 5 consecutive sentences. Heuristically, we choose top-k path with k = 3 for each entity pair.
The shared weight CNN employs the Adam optimizer [37] and uses Glorot random uniform [38] initialization. The mini-batch training size is set to 128. Surveying the data has shown an undesirable consequence of the subgraph representation. That is an unexpected increase in negative data. For intra sentence problem, the ratio of positive and negative is about 1 : 2. But using the subgraph this ratio is 1 : 2.95, 1 : 3.53, 1 : 3.85, 1 : 4.05 and 1 : 4.20 respectively for window sizes 1, 2, 3, 4 and 5 (note that the title is always connected to the first sentence in sliding window). This leads to an imbalanced data problem, which may negatively influence system performance caused by the bias to the negative label. To minimize the impact of this problem, we assign the class weights to give priority to the minor classes (positive). At this time, we cannot learn this weight automatically. Therefore, we set them heuristically as 3 : 1 for positive : negative.
We fine-tuned our deep learning model using training and development subsets (as described in Table 1). The optimized model's hyper-parameters in detail are shown in Table 2. For the final results, we use these configurations to run the training process 100 times and report the average results of 100 runs. The training time for each run is about 17.5 hours. The prediction time for the BC5 test set using the trained model is about 2 minutes.
We also apply some techniques to overcome overfitting, including max-norm regularization for Gradient descent [11]; adding Gaussian noise [13] with the mean of 0.001 to the input embeddings; applying dropout [39] at 0.5 after all embedding layers and CNN layers; and using early stopping technique [40].

Results
We present this section in four main parts: the contribution of proposed virtual edges; the effectiveness of subgraph windows sizes, the ablation test results of the model components; and the comparison between our results and other state-of-the-art models.

Effect of the injected virtual edges in the document subgraph
We study the contribution of injecting virtual edges on the system performance by ablating each of them in turn from the graph and afterward evaluating the model with the sliding window size w = 2 and top-3 shortest paths for each entity pair (k = 3). We compare these experimental results by the changes of Precision (P), Recall (R) and F1-measure in Table 3 and Fig. 4.
This experiment presents an exciting view of the contributions for each type of virtual edge in the document subgraph. When removing NEXT-SENT from the graph, the results decrease in terms of all Precision, Recall and F1. The same results appear when we remove TITLE.
In addition, although the COREF-sent, COREF-to-title and KB-CTD help to find some more correct relations, it brings too many false-positive results and leads to worse Precision (removing them boosts the Precision but gives a bit lower Recall).
Using the COREF-from-title connection also reduce F1, but because it adversely affects heavily Recall whist only gives a minimal contribution to Precision.
These experimental results have raised a challenge that if we want to use the information about coreference and knowledge-bases, we need some additional methods to increase the quality of the information obtained. We left this problem for further work. Therefore, in the next experiments, we only use two connections NEXT-SENT and TITLE.

Effect of different sliding window size w for training and testing
We describe the change of the model's performance with different sizes of the sliding window in Fig. 5. The larger w helps to increase Recall but leads to a worse Precision. This is an easy-to-explain result because with a larger w we will get more paths, but more noise. The equilibrium point of Precision and Recall gives the highest F1 result at w = 2, in detail, we have Precision = 61.25%, Recall = 61.26% and F1 = 61.25%. More importantly, this statement also raises an idea to take advantage of a large w but minimize the impact on Precision at the lowest level that whether we use the different window sizes for training and testing. The larger window size for training helps to collect new patterns in the text. The smaller window size for testing helps to reduce noise and narrow the allowed distance between two entities. To demonstrate this idea, grid search experiments with k = 3 were conducted, the results are shown in Table 4.
The results have verified the effectiveness of the proposed ideas. With the larger w for training size, we have better Recall but worse Precision. For each training window size, the smaller w for testing always brings better F1 than the larger w. The best F1 archived with w = 5 for training and w = 2 for testing, increase 1.34% compared to the best results of using the same window size for training and testing.

Contribution of the model components
We further investigate the contribution of each component in Table 5, which shows changes in F1 when ablating each component from the proposed model.
The F1 reductions illustrate the contributions of all proposals to the final result. However, the level of contribution is varied among the different components. The document subgraph has proven its superiority by boosting the F1 by 6.49%, in which the Recall increases 10.73%. Both TITLE and NEXT-SENT connections have shown a significant influence on model performance. The interesting observation TITLE edges seem to play a leading role: eliminating it reduces the F1 by 5.47%. NEXT-SENT information also plays an essential role since removing it reduces F1 by 2.60%. Our proposed instance merging technique also has a significant contribution, without using it, F1 increases 3.22%. The sharedweight CNN on top-k paths demonstrated its good influence on the results by boosting F1 by 1.84%. Another experiment on using alternative methods for choosing top-k paths (by their repetitions frequencies instead of the shortest length) seems not suitable since it leads to a slight reduction in F1. As discussed above, the use of difference w for training and testing also brings a reduction of F1. Adding class weight and attention

Comparison to existing models
We compare the performance of our model against nine competitors. The first three models are capable of predicting intra sentence relations only, the next six models have the ability to extract inter sentence relations:  [20]. − Biaffine Relation Attention Network (BRAN) takes advantage of the state-of-the-art attention tool Transformer [17].
− The labeled edge graph convolutional neural network model on a document-level graph [18]. The graph is constructed using various inter-and intra sentence dependencies to capture local and non-local dependency information. Table 6 summarizes the performance of our model and some comparative models. In which, the results of comparative models are reported both with and without using any additional enhancements.
Our model yields very competitive results when compared to other state-of-the-art models that have taken into account the inter sentence relationships. Compare to the original model without any additional enhancements, our model gives the best results with 62.88%.
Applying distant supervision learning and ensemble technique, our model still achieves the best result among  We also show the detailed results for intra-and inter sentence relation extraction in Table 7. In which, we exclude all inter sentence relations when evaluating intra sentence relation extraction results and vice versa.

Error analysis
We studied model outputs to analyze system errors and improvements as shown in Table 8. For further analysis, we use the output of RbSP-an advanced intra sentence relation extraction model [30]-for comparison, its results are shown in column 'Comparative model'. The full versions of the abstracts that used in Table 8 are given in Additional file 2: Appendix B. The former part (Examples #1 − 6) shows the effect of the graph-based model on intra sentence relations. It helps find some more intra sentence (Example #1 − 2) relations since graph-based representation enriches many useful patterns for training. However, it also causes new noises (Example #3 − 4), i.e., some examples are properly correctly labeled by the comparative model, but wrongly by the graph-based model. Example #5 − 6 are errors that are not improved.
The latter part (examples #7 − 10) focuses on the interrelation extraction, these relations occupy about 30% of the instances in BC5 CDR corpus and cannot be extracted by the intra sentence model. Example #7 provides an improvement, as the graph model extracts the inter sentence relation correctly. In the case of producing falsepositive results (Example #8), the graph-based model is penalized since turning a true negative into a false positive. Moreover, the graph model still misses many cases (Examples #9 − 10).
These errors can be attributed to the limitations of our model, including (a) Many errors seem attributable to the parser. Example #9 is the case that we cannot generate any dependency path between two participated entities. The comprehensive analysis shows that our document subgraph representation with w = 2 covers only ∼ 93% of total instances in test data (98% intra sentence relations and 87% inter sentence relations), in the remaining cases, we cannot generate any path between two entities. (b) The information in the path may still be insufficient or redundant to make the correct prediction. (c) The graph-based representation brings many noises. New virtual edge also  (FN). Finally, we found some errors caused by the imperfect gold annotation (gold missing relation or gold false relation). Example #11 shows the case that our model finds a correct relation while gold standard annotation does not include. Another annotation errors (Example #12) come from the hierarchy manner. BC5 CDR corpus only annotates relations between the most specific entities, i.e., excludes the relations that involve entities that are more general than other entities already participated in the CID relation of each abstract [26].

Discussion
In this work, we present a novel representation for a sequence of adjacent sentences in a document (namely document sub-graph). The graph is constructed using var-ious types of information to capture local and non-local features. Knowledge-based information is also used to expropriate the manual realistic information to the model.
We also propose an instance merging mechanism and using a set of multiple paths for representing the relationship between entities pair. Our proposed model outperforms all comparative models in experiments on BC5 CDR corpus without using external knowledge resources and additional enhancements. Comparing the full model performance, our model still achieves comparable results  when compared with the current state-of-the-art model (Verga's BRAN model) [17].
When compared with the related work, the highlight of our proposed model is the use of document graphs with different train-test window sizes. To the best of our knowledge, most other studies approach in the direction of seeking relationships in one or several consecutive sentences [20,24,25,28,42]. Our model solves the problem of extracting relations in the whole document. This idea is similar to the study of Verga et al. [17], but they are in the direction of using the attention mechanism to find important information in the text. Instead, we build extract the information on the graph in a linguistic-based manner.
From the perspective of model usage in real-world applications, while graph building and model training are time consuming, they can be done offline. New data processing time is not fast enough to process big data but can be used to extract relations from small and medium datasets in reasonable time. Another problem when applying the model is processing full text. Through research and data survey, the abstract contains the basic information of the article. Basically, it is necessary to investigate more closely because the characteristics of full text and abstract are quite different. For example, with full text processing, window size of 5 may not be enough, two related entities may be very far apart. Extracting the relationship in full text will need some extra processing steps. We leave these problems for the future work.
We also investigated the results in detail to figure out our limitations for future improvements.
• Firstly, coreference and discourse resolutions should be analyzed carefully to find a suitable and more effective approach for application. • Secondly, the valuable information coming from knowledge bases needs to be used more reasonably instead of being integrated directly into graphs. • Thirdly, our model's results resolutely depend on the performance of the dependency parser. This problem leads to the limitation that we must deal with many cascade errors from the processing step. We are planning to use another parser, which is specially built for the biomedical domain. • Lastly, the ensemble mechanism should be improved to have higher results. However, run the graph-based models for many times is quite a time-consuming work; this approach needs an adaptation to be more suitable for the graph-based model.

Conclusions
In this paper, we present a novel representation for a sequence of consecutive sentences in a document (namely document subgraph). The graph is constructed using various types of information to capture local and non-local features. We also propose an instance merging mechanism and use a set of multiple paths for representing the relationship between entity pairs. To explore the information in the document subgraph, we construct a deep neural architecture based on a shared-weight convolutional neural network. The interesting analysis is that not all the types of new edges in the graph are useful for inter sentence rela-tion extraction. Only connections of title-sentences and between consecutive sentences are useful. In addition, all components and techniques that we applied in the proposed model show their contributions to the performance at a different level.
In experiments on BioCreative V CDR corpus, without using any external knowledge resources and additional enhancements, our proposed model outperforms all comparative models. We also investigated the results in detail to figure out our limitations for future improvement. The experimental results and error analysis help us to prioritize the future work.